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Abstract — Malware is the main threat for all computing 
environments. It also acts as launching platform for many other 
cyber threats. Traditional malware detection system is not able to 
detect “modern”, “unknown” and “zero-day” malware. Recent 
developments in computing hardware and machine learning 
techniques have emerged as alternative solution for malware 
detection. The efficiency of any machine learning algorithm 
depends on the features extracted from the dataset. Various types 
of features are extracted and being researched with machine 
learning approach to detect malware that are targeted towards 
computing environments. In this work we have organized and 
summarized different feature types used to detect malware. This 
work will direct future researchers and industry to make decision 
on feature type selection regarding chosen computing 
environment for building an accurate malware classifier. 
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I. Introduction 

Malware is a computer program, intensely written to harm 
computing resources. Malware are of different type based on 
structural and behaviors difference, such as Virus, Worm, 
Trojan, Bot, Spyware, Adware, Rootkit, Bootkit, Ransom ware 
etc. Growth of variant of known malware and new malware is 
increasing year-by-year [1] and posing threat to digital 
infrastructure. 

Malware is main threat for all four kind of computing 
environments: 1) Personal computing; 2) Mobile computing; 3) 
Embedded computing; and 4) Industrial control system (ICS) 
computing. Although a large percentage of total malware 
targeted for first two environments i.e. personal and mobile 
computing, recent past have seen a major surge in malware 
targeting other two environments as well i.e. embedded and 
ICS computing. To tackle malware threats personal and 
mobile computing have some traditional solutions such as 
signature and heuristic based anti-malware but embedded and 
ICS computing are wide open for malware attack. Traditional 
solutions are not effective in detecting malware at either of 
computing environments as they have inherited limitations [2]— 
[4]. 
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Signature based techniques are backbone of these 
traditional anti-malware solutions which itself is totally unable 
to work against “modern”, “unknown” and zero-day ” 
malware. Signature based techniques works in two phases: 1) 
Signature creation; and 2) Signature matching. Signature 
creation is a multi-step process which involves steps such as 
malware collection, malware analysis, signature generation and 
signature distribution. All of these steps are carry out with the 
help of human and machine in proportion. Human involvement 
makes process costly and slow which provide a large attack 
window to malware. Machine works on the principal of 
generalization which misses artifacts of “modem”, “unknown” 
and “zero-day” malware. Signature matching is also mutli-step 
process such as file scanning, signature look-up and alerting 
user. It performance is dependent on previous phase i.e. 
signature creation because it can only match signatures which 
are in the database and so “zero-day” and “unknown” malware 
will escape the detection. Apart from technical bottlenecks, 
signature based techniques also suffers with other drawbacks 
like costly analysis process, high computation and memory 
requirements at end host and requirement of regular signature 
updates. These bottlenecks and drawbacks of signature based 
techniques created a need of alternative anti-malware solution 
and machine learning based techniques is emerging to fulfill 
the same. 

Machine learning techniques consider malware detection as 
a binary or multi-class classification problem which is similar 
to many other domains. Machine learning based malware 
detection has two phases: 1) training; and 2) classification. 
Training is a multi-step process and sequential in nature. Steps 
involve in building malware classifier are: sample collection, 
sample labeling, feature extraction, feature selection and model 
building. Sample collection is process of collecting malware 
and benign programs which is precedence by sample labeling 
which assign true class label (malware and benign) to each 
sample. Labeled samples are ready for further step which is 
feature extraction. Feature extraction is very important step of 
overall machine learning process. Extracted features mainly 
decide classifier performance and so with different type of 
features classifier perform differently. This decisive nature of 
feature attracts lots of engineering methods to extract different 
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type of features which have more discriminative values than 
others. Feature selection deals with excessive extracted features 
and help to filter out only few useful features on the basis of 
discriminative rank. Model building is last step of training, it 
takes selected features and run machine learning algorithms 
which output a model which would be able to classify inputted 
new sample. During classification phase each sample goes 
through the feature extraction phase but now only those 
features are extracted on which the model is built. Built model 
takes these extracted features as input and output the probable 
class label for each sample. 

Machine learning based malware detection is suitable for 
“modem”, “unknown” and “zero-day” malware detection 
because it is not per sample base technique as signature based 
techniques are, instead it works by learning malware and 
benign classification based on extracted features from training 
dataset and able to generalized to unseen samples. Features 
type plays an important and decisive role for accurate malware 
detection using machine learning. Many researches in domain 
of malware detection using machine learning focus on different 
feature type which are extracted by various methods and 
impact classifier performance. 

Over years many feature type for malware detection is 
proposed and experimented which are spread over all of 
computing environments. In this work, we have organized and 
summarized different feature type used for various file types on 
different computing environments. Due to vary computing 
architecture, supported file types and analysis method among 
different computing environments, feature types vary across 
these environments. Classifiers performance depend largely on 
features types and feature selection, hence having a well 
organized literature on feature types will help in easy decision 
making for various entities involved such as future researchers 
and industries developers. 

II. Methods 

Feature type can be group primarily according to 
computing environments and further under each computing 
environment it can be organized according to the analysis type. 
In this section four computing environments are explained and 
then each of analysis type is explained. 

A. Computing Environments 

Computing environment is term use to describe a complete 
computing platform comprise of hardware, operating system 
(OS), and other software. Each computing environments 
differs on aforementioned components. Each of computing 
environment is explained further. 

1) Personal computing: Personal computing refer to the 
use of Desktop and laptop computer which are used in normal 
day-to-day life and in various enterprises to automate the 
tasks. Distinguished dimensions of Personal computing 
environment are a bigger output screen, high internal and 
external memory, desktop OS and attached keyboard and 
mouse. 


2) Mobile computing: Mobile computing refer to all of 
those devices which are mobile in nature and having a smaller 
screen size than personal computing devices and limited with 
small battery. All such devices have specific mobile operating 
system and customized operating system. Android, iOS and 
Windows are there main leading mobile OS. 

3) Embedded computing: Embedded computing refers to 
all those smart digital devices which have configuration 
options and run a specialized operating system designed for 
such embedded system. OS running in smart car, smart home, 
modern freeze and many other modem digital appliances are 
example of embedded computing. 

4) ICS computing: Industrial control system (ICS) 
computing refers to those devices which run specialized OS 
and software to control and monitor industrial system such as 
digital power and water distribution system, nuclear plant etc. 

B. Analysis and Feature Type 

Feature extraction involves two type of analysis which 
carried out in different manners and gives features which have 
vary discriminative values. Static and dynamic are two type of 
malware analysis techniques which provide three different 
types of features: 1) Static features; 2) Dynamic features; and 
3) Hybrid features. Each of three feature type is explained 
further. 

1) Static features: Static analysis is method of malware 
analysis, in which sample are analyzed statically i.e. without 
executing the sample and only structural and physical property 
are analyze. Static feature are those features which are 
extracted by aforementioned static analysis method. Static 
analysis are safe and fast because sample are not executed 
hence the analysis platform will not be affected and so many 
samples can be analyzed without cleaning the analysis 
environment. Static feature are easy to extract without it 
doesn’t not require complex execution and monitoring 
process. 

2) Dynamic features : Dynamic analysis is process of 
executing the sample, monitoring the analysis environment 
and recording the changes made during the execution time. 
Dynamic features are those features which are extracted from 
the recorded changes of dynamic analysis. Dynamic analysis 
is time consuming and complex but it handles many of the 
limitations of static analysis such as it enable to extract 
features from packed and obfuscated malware sample which 
can’t handle by static analysis. 

3) Hybrid features: Hybrid features are combination of 
static and dynamic features. Integrating static and dynamic 
features enrich the discriminative power of feature set and 
improve the performance of the malware classifier. Although 
it’s very beneficial but same time it is very costly in term of 
analysis time and computing. 
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III. Features for Personal Computing 

Personal computing has a larger user base because of it use 
in various domains. Windows OS is leading with respect to 
number of users. Large number of users attracts the attackers 
which results in huge number of malware targeting only 
Windows user. Similarly, detection solution is also centric 
toward Windows malware. In this section different features are 
listed and explain which are used to build malware classifier. 

A. Strings 

Every executable or any other files have "strings" in it 
source which have been used as feature for building malware 
classifiers. All strings present in source files are extracted by 
static or dynamic methods and used as features following text 
classification approach. Static method for strings extraction 
has explained in [5], [6] while strings collected during "runtime 
trace" (by dynamic analysis) have been explained and used in 

m. 

B. DLL & API Call 

DLL and API call are also used as features for malware 
detection. These two can be extracted by both static and 
dynamic analysis method. DLL and API are used as Boolean 
features, which is prepared by extracting all DLL and API call 
from malware and benign class and taken as features. Present 
and absent of DLL and API use in a sample is considered as 6 1’ 
and ‘0’ respectively [8], [9]. 

C. Byte-n-grams 

Byte-n-grams use the frequency of "n" consecutive bytes in 
hexadecimal representation of a given sample as feature. Byte- 
n-grams are achieved by static analysis in two steps: 1) 
Converting sample to its hexadecimal representation, and 2) 
Processing and extracting byte-n-grams. Byte-n-grams based 
feature set is frequently used to build malware classifier [5], 
[10] — [14]. 

D. Opcode-n- grams 

Opcode-n-grams use the frequency of "n" consecutive 
opcode in assembly representation of a given sample as feature. 
Opcode-n-grams are achieved by static analysis in two steps: 1) 
Disassemble the sample, and 2) Processing and extracting 
opcode-n-grams. Opcode-n-grams based features have two 
variants, one with consider operand along with opcode and 
other which doesn’t take operand in consideration [12], [15]— 
[ 20 ]. 

E. PE Header Fields 

PE headers fields values are also used as feature set for 
building malware classifier. This feature is not applicable to all 
file types but limited to all PE file format such DLL, exe etc. 
DOS_HEADER, FILE_HEADER and OPTIONAL_HEADER 
are three main headers from which fields value are extracted 
and used as feature. Various approach existed to utilize these 
values but all of them are carried out by static analysis method. 
With variation classification performance vary [8], [9], [21]— 
[27]. 


F. Network and Host Activity 

Dynamic analysis provides way to monitored and extracted 
dynamic behaviors such as network and host activities. During 
the analysis time every interaction of network and host is 
monitored and recorded. From recorded files various kind of 
features are extracted and used for building machine learning 
based malware classifier [28]-[30]. 

G. Image Properties 

Image properties are being used as features in image 
classification domain but by converting binary files to image 
can tap this potential for malware classification. With this 
motivation Nataraj et al. have converted binary file to 
grayscale image and then extracted GIST features from image 
to build malware classification system [31]. 

H Hardware Features 

Hardware activity is bottom of any program execution and 
so monitoring it gives an accurate representation of program 
behaviors. Wang et al. [32] have used hardware interaction 
based features for building malware detection system. 

IV. Features for Mobile Computing 

Android is leading operating system for mobile computing 
which spread across smart-phone to tablet. Due to a larger user 
base, Android is main target of security attacks and malware is 
one of major threat to Android. Many research works have 
considered this challenge and have proposed many solutions to 
keep Android safe from malware attacks. To the limitations of 
signature based detection, most of current works are focused on 
machine learning based Android malware detection. In this 
section, different features which are devised and used to build 
android malware detection system are listed and a short 
description along with appropriate work is presented. 

A. Permissions and Intents 

Permissions and Intents are very crucial for any Android 
application; it decides the app functionality and behaviors. 
Using permissions and intents as features resulted in high 
classification performance for Android malware. These two are 
mostly used as Boolean features but few works have 
considered these as numeric feature [33]— [41]. Presence and 
absence of permissions and intents are taken as Boolean 
features whereas assigned weight of each permission is taken in 
numeric features. 

B. Strings 

Similar to desktop files, Android applications also have 
different type of stings to accomplish various tasks. By 
extracting and using these strings, a feature set can be build to 
classify Android application into malware and benign [42] . 

C. System Calls 

System calls are bridge between user space and kernal 
space, a user event results into one or more system calls. By 
recording system call patterns of malware and benign, a 
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Boolean feature set can be created to build Android malware 
classification system [34], [43]— [47]. 

D. Image Properties 

Features extracted from image such as SIFT, GIST and 
HOG have demonstrated higher classification performance in 
computer vision domain. To utilize these features for Android 
malware classification, “apk to image” conversion is uses as 
per-processing step which convert apk file to image according 
to specified color map. From resulted image aforementioned 
features are extracted and used for building Android malware 
classifier [48]. 

E. Network and Host Activities 

Similar to personal computing mobile’s network and host 
activities can be recorded while a sample is in execution. Such 
dynamic trace provides better representation of an application 
behaviors hence yields better classification accuracy with 
machine learning models [49]— [56]. 


TABLE I. Various Features with their Analysis Type 


Features 

Analysis Type 

Personal computing 

Strings 

Static & Dynamic 

DLL & API call 

Static 

Byte-n-grams 

Static 

Opcode-n-grams 

Static 

PE headers fields 

Static 

Network and Host activities 

Dynamic 

Image properties 

Static 

Hardware features 

Dynamic 

Mobile Computing 

Permissions and Intents 

Static 

Strings 

Static 

System calls 

Static & Dynamic 

Image properties 

Static 

Network and Host activities 

Dynamic 


V. Features for Embedded Computing 

Embedded computing is getting popular due to improved 
hardware and advancement in software. Circuit based 
instruction methods are getting replaced with software 


alternatives which provides more functionality and flexibility 
(automated car, digital appliance etc.). This migration also 
brings threats associated with software such as malware. In 
recent past, few attacks on embedded system are reported but 
due to few users the malware problem is not getting attention. 
In future, with increase in user base and malware attacks this 
serious issue will be addressed. 

VI. Features for ICS Computing 

Industrial Control System (ICS) comprise computing and 
network infrastructure for monitoring, automating and 
controlling the industrial system for example nuclear power 
plant, water and electric distribution & controlling network etc. 
Affect of attacking and damaging such infrastructure will be 
very dangerous and not only financial loss will occur but much 
life will be lost. Stuxnet [57] is one of such malware which was 
written and released targeting SCADA system installed in 
nuclear plant. Works have started to secure ICS system but use 
of machine learning is very limited. Most of works are focused 
on patching the known weakness of computer networks and 
systems. Solution targeting malware is very limited but future 
works will benefits of machine learning based solutions for 
securing ICS computing environment. 

VII. Conclusions and Future Works 

In this work, we have summarized the different features 
used with machine learning to detect malware in various 
computing environment. This work provides a state-of-art 
status of machine learning based malware detection. 

In future work, an empirical study will be conducted to 
validate the detection rate of various features and will try to 
filter out the most effective and efficient features to detect 
malware with respect to various computing environments. 
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